Bioterrorism has created a need for the rapid analysis of samples which may contain toxins, viruses, or other deadly agents. Mass Spectrometry (MS) provides a tool for use in proteomics for accurate and comprehensive profiling of proteins. An automated software tool which can search for matches against a proteome database is useful for forensic analysis of samples and an effective countermeasure to Bioterrorist attacks. A software tool called MARLOWE was tested and it worked well but failed to identify organisms of interest such as the toxin, Arbrin, that were missing from the KEGG.JP(Kanehisa et al. 2002) database on which it relies. FTP access to KEGG.JP is cost prohibitive. Here we create a database for MARLOWE using public Uniprot.org proteome database (Consortium 2020). By creating a process to update this MARLOWE database, we can ensure that target organism are present and identified correctly.
Installed MARLOWE on a 32-core, 500GB ram Ubuntu Linux server with MySQL 8.0.31 to host the UniProt candidate database. MARLOWE packages were modified to run correctly on R version 4.2.1 using RStudio IDE.
The function of MARLOWE was evaluated on the KEGG version and the UniProt version with 8 data files from biological samples including Fish, Milk, Oyster, Juice and Castor bean. MARLOWE was run with the data that has been processed by PEAKS DeNovo assembler to determine the peptides contained in the samples. The organism identified from each MARLOWE run was compared with the actual contents of the sample in order to make a conclusion on performance or improvements needed.
The parse_fasta.R package was developed to read UniProt proteome FASTA files. The FASTA format for the UniProt database contains the minimum required fields for MARLOWE but must be parsed differently since it is vastly different from the KEGG format. There are two different FASTA header formats used by UniProt Uniref and UniProtKB. The parse_fasta R function, examines the file to determine which header format is in use and applies the appropriate parsing via Regular expressions.
UniProt identifies organisms with NCBI taxonID (OX). The KEGG database used the unique kegg_id identfier. Prefixing “U” in front to the integer taxonIDreplace the kegg_id key index field for organisms downloaded from UniProt. The KEGG .ent files provide more fields than are available in the UniProt FASTA header so many fields are set to NULL for the UniProt organism data. These include Pathways, Database Links, Module, Brite, Position, and Motif. They are not required for the MARLOWE algorithm.
I built a minimal “candidate” database with UniProt proteome data for 9 organisms matching those contained in the test samples. Building the database involves downloading and parsing FASTA proteome files, then inserting organism identification into the database along with the amino acid sequences for all proteins and the peptides that result from digesting these proteins with Trypsin. A final step is to upload NCBI taxonomy data for all organisms used to produce the MARLOWE heatmaps. The following shows the contents of the MySQL database to verify organisms have been inserted correctly. The organisms inserted and the quantities of peptides which resulted from the in-silico digestion of proteins are shown. The strong peptide count is also calculated and was verified.
| name | taxon_id | protein_count | peptide_count |
|---|---|---|---|
| Bos taurus | 9913 | 23844 | 652649 |
| Citrus clementina | 85681 | 24934 | 586056 |
| Citrus sinensis | 2711 | 28128 | 572368 |
| Crassostrea gigas | 29159 | 25998 | 687216 |
| Crassostrea virginica | 6565 | 33719 | 876976 |
| Ricinus communis | 3988 | 31219 | 630447 |
| Pseudomonas fragi | 296 | 4324 | 85668 |
| Salvelinus namaycush | 8040 | 35973 | 696618 |
| Chlamydia pneumoniae | 83558 | 1052 | 23031 |
Here is the MARLOWE output for a Ricinus Communis lab sample. R. communis was correctly identified with a score of 420 on UniProt and 303 on KEGG database.
UniProt Heat Map R. communis Prep (castor bean)
The MARLOWE score for each test sample is show in Table 2. In the Fish sample, P. fragi bacteria was found, indicating spoilage. In the Juice sample, 2 orange species were detected.
| Sample | UniProt | KEGG |
|---|---|---|
| R. communis 9 | 164 | 111 |
| R. communis GC4 | 654 | 476 |
| 55551-DeNovo | 420 | 303 |
| 555558-DeNovo | 2 | 0 |
| Fish-DeNove | 330 | 16 |
| Juice-DeNovo | 130 | 423 |
| Milk-DeNovo | 432 | 109 |
| Oyster-DeNovo | 714 | 306 |
Alternate Digestion Enzymes: Currently MARLOWE only supports Trypsin digest. We can construct another version of the database where the peptides have been digested with an alternate protease enzyme.
Efficiency Improvements: The time required to build the sample database was about 24 hours with 9 organisms. We will need to improve the speed of this process using parallel computing and multiple servers. Exploring faster algorithms may also lead to improvements.
User Interface Improvements: Converting the program to run in batch from the Linux bash shell would be more efficient and less error prone than the current RMarkdown template run within RStudio. Another option is to run MARLOWE from a web server with a GUI interface. This would be possible by creating an R-Shiny version where the scientist could select their input files using a GUI and then the pipeline would run automatically and produce and output report that could be viewed and downloaded.
This project was a proof of concept to validate the parse_fasta package and the process for building a UniProt sourced candidate database. It has produced accurate and expected results with the small number of test cases that were used. Building a fully functional database will require 10,000-22,000 organisms. It is best to work toward this incrementaly and evaluate performance of each iteration.